CASE STUDY - 1 :: Healthcare Provider Fraudulent Detection

drawing



1. This notebook contains the extensive data analysis of { BENE + IP + OP + Frauds Providers Labels } performed on the publicly available dataset at Kaggle with the intent:

2. I have also added some new features with the intent to bring some more business aspects in the dataset. Also, performed their impact analysis on the potential frauds.

Kindly checkout the below Deck for better understanding the BUSINESS oriented insights about this problem:

Kindly checkout the below Doc for TECHNICAL Design description about this problem:

Notebook Contents

CASE STUDY - 1 :: Healthcare Provider Fraudulent Detection

Notebook Contents

Downloading_Train_Data_Files

TRAIN set files

Importing_Libraries

Importing_Dataset

Exploring_Target_Labels_Data

Adding the Admitted or Not Admitted indicator in IP and OP Dataset

Merging the Datasets

Merging the IP_OP Dataset with BENE Data

Merging the IP_OP_BENE Dataset with PROVIDER level Tgt Labels Data

Entire Dataset

ASSUMPTION :: One provider may have been involved in more than one claim. So, does all the claims filed by a potentially fraud provider are all frauds?

Therefore, it is a big assumption to make that all the claims filed by a potentially fraud provider are fraudulent.

Feature Engineering + Impact Analysis

Adding New Feature :: Is_Alive?

Adding New Feature :: Claim_Duration

Adding New Feature :: Admitted_Duration

Adding New Feature :: Bene_Age

Does InscClaimAmtReimbursed influences Potentially Fraud?

Does IPAnnualReimbursementAmt influences Potentially Fraud?

Why do we have IP Annual Re-Imb Amount as 0 for Admitted Patients?

Does OPAnnualReimbursementAmt influences Potentially Fraud?

Why do we have OP Annual Re-Imb Amount as 0 for Admitted Patients?

Adding New Feature :: Total Number of false claims filed by a Provider

Adding New Feature :: Total Number of claims or cases seen by Attending Physician

Adding New Feature :: Total Number of claims or cases seen by Opearting Physician

Adding New Feature :: Total Number of claims or cases seen by Other Physician

Adding Combined Feature :: Att_Opr_Oth_Phy_Tot_Claims

Adding 3 New Features :: Prv_Tot_Att_Phy, Prv_Tot_Opr_Phy and Prv_Tot_Oth_Phy

Adding Combined Feature :: Prv_Tot_Att_Opr_Oth_Phys

Adding New Feature :: Total Unique Claim Admit Codes used by a PROVIDER

Adding New Feature :: Total Unique Number of Diagnosis Group Codes used by a PROVIDER

Adding New Feature :: Total unique Date of Birth years of beneficiaries provided by a Provider

Adding New Feature :: Sum of patients age treated by a Provider

Adding New Feature :: Sum of Insc Claim Re-Imb Amount for a Provider

Adding New Feature :: Total number of RKD Patients seen by a Provider

Exploratory Data Analysis

Q1. Which are the Top-25 Providers with maximum number of fraudulent cases?

Q2. Which are the Top-25 Providers with maximum number of non-fraudulent cases?

Q3. Which are the Top-25 Attending Physicians with maximum number of fraudulent cases?

Q4. Which are the Top-25 Attenting Physicians with maximum number of non-fraudulent cases?

Q5. Which are the Top-25 Operating Physicians with maximum number of fraudulent cases?

Q6. Which are the Top-25 Operating Physicians with maximum number of non-fraudulent cases?

Q7. Which are the Top-25 Other Physicians with maximum number of fraudulent cases?

Q8. Which are the Top-25 Other Physicians with maximum number of non-fraudulent cases?

Q9. Which are the Top-25 ClmAdmitDiagnosisCode with maximum number of fraudulent cases?

Q10. Which are the Top-25 ClmAdmitDiagnosisCode with maximum number of non-fraudulent cases?

Q11. Which are the Top-25 DiagnosisGroupCode with maximum number of fraudulent cases?

Q12. Which are the Top-25 DiagnosisGroupCode with maximum number of non-fraudulent cases?

Q13. Does Age_groups have any relationship with maximum number of fraudulent cases?

Q14. Does Age_groups have any relationship with maximum number of non-fraudulent cases?

Q15. Which are the Top-25 States with maximum number of fraudulent cases?

Q16. What are the Top-25 States with maximum number of non-fraudulent cases?

Q17. Which are the Top-25 Country with maximum number of fraudulent cases?

Q18. What are the Top-25 Country with maximum number of non-fraudulent cases?

Q19. Does various Human Races have any relationship with maximum number of fraudulent cases?

Q20. Does various Human Races have any relationship with maximum number of non-fraudulent cases?

Downloading_Train_Data_Files

Importing_Libraries

Importing_Dataset

Exploring_Target_Labels_Data

OBSERVATION

Adding the Admitted or Not Admitted indicator in IP and OP Dataset

Merging the Datasets

Merging the IP_OP Dataset with BENE Data

Merging the IP_OP_BENE Dataset with PROVIDER level Tgt Labels Data

Entire Dataset

ASSUMPTION :: One provider may have been involved in more than one claim. So, does all the claims filed by a potentially fraud provider are all frauds?

- This cannot holds True for all the providers because if one provider has filed say 50 claims then we can't say that all the claims for that provider are fraudulent. 
    - There may exists a pattern that out of 50 claims a provider files 1 or 2 fraudulent claims. 

Therefore, it is a big assumption to make that all the claims filed by a potentially fraud provider are fraudulent.

OBSERVATION

OBSERVATION

Feature Engineering + Impact Analysis

Let's create some features

Adding New Feature :: Is_Alive?

- Is Alive? = No if DOD is NaN else Yes

Adding New Feature :: Claim_Duration

- Claim Duration = Claim End Date - Claim Start Date

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

Adding New Feature :: Admitted_Duration

- Admitted Duration = Discharge Date - Admission Date

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

Adding New Feature :: Bene_Age

- Bene Age = DOD - DOB (if DOD is Null then replace it with MAX date in DOD)

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

Does InscClaimAmtReimbursed influences Potentially Fraud?

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

Does IPAnnualReimbursementAmt influences Potentially Fraud?

OBSERVATION

Why do we have IP Annual Re-Imb Amount as 0 for Admitted Patients?

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

Does OPAnnualReimbursementAmt influences Potentially Fraud?

OBSERVATION

Why do we have OP Annual Re-Imb Amount as 0 for Admitted Patients?

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

OBSERVATION

Adding New Feature :: Total Number of false claims filed by a Provider

- Logic :: COUNT(all claims submitted by a Provider) - COUNT(all non-fraud claims submitted by a Provider)

REASONING

drawing

Adding New Feature :: Total Number of claims or cases seen by Attending Physician

OBSERVATION

Adding New Feature :: Total Number of claims or cases seen by Opearting Physician

OBSERVATION

Adding New Feature :: Total Number of claims or cases seen by Other Physician

OBSERVATION

OBSERVATION

Adding Combined Feature :: Att_Opr_Oth_Phy_Tot_Claims

OBSERVATION

Adding 3 New Features :: Prv_Tot_Att_Phy, Prv_Tot_Opr_Phy and Prv_Tot_Oth_Phy

OBSERVATION

OBSERVATION

OBSERVATION

Adding Combined Feature :: Prv_Tot_Att_Opr_Oth_Phys

OBSERVATION

Adding New Feature :: Total Unique Claim Admit Codes used by a PROVIDER

OBSERVATION

NOTE :: What didn't worked?

Adding New Feature :: Total Unique Number of Diagnosis Group Codes used by a PROVIDER

OBSERVATION

NOTE :: What didn't worked?

NOTE :: What didn't worked?

drawing

NOTE :: What didn't worked?

drawing

Adding New Feature :: Total unique Date of Birth years of beneficiaries provided by a Provider

Read more at: https://economictimes.indiatimes.com/news/politics-and-nation/private-hospitals-perform-fake-surgeries-to-claim-thousands-in-insurance-cover/articleshow/16934229.cms?utm_source=contentofinterest&utm_medium=text&utm_campaign=cppst

OBSERVATION

Adding New Feature :: Sum of patients age treated by a Provider

OBSERVATION

Adding New Feature :: Sum of Insc Claim Re-Imb Amount for a Provider

OBSERVATION

Adding New Feature :: Total number of RKD Patients seen by a Provider

OBSERVATION

Exploratory Data Analysis

Let's find some trends

Q1. Which are the Top-25 Providers with maximum number of fraudulent cases?

OBSERVATION

Q2. Which are the Top-25 Providers with maximum number of non-fraudulent cases?

OBSERVATION

Q3. Which are the Top-25 Attending Physicians with maximum number of fraudulent cases?

OBSERVATION

Q4. Which are the Top-25 Attenting Physicians with maximum number of non-fraudulent cases?

OBSERVATION

Q5. Which are the Top-25 Operating Physicians with maximum number of fraudulent cases?

OBSERVATION

Q6. Which are the Top-25 Operating Physicians with maximum number of non-fraudulent cases?

OBSERVATION

Q7. Which are the Top-25 Other Physicians with maximum number of fraudulent cases?

OBSERVATION

Q8. Which are the Top-25 Other Physicians with maximum number of non-fraudulent cases?

OBSERVATION

Q9. Which are the Top-25 ClmAdmitDiagnosisCode with maximum number of fraudulent cases?

OBSERVATION

Q10. Which are the Top-25 ClmAdmitDiagnosisCode with maximum number of non-fraudulent cases?

OBSERVATION

Q11. Which are the Top-25 DiagnosisGroupCode with maximum number of fraudulent cases?

OBSERVATION

Q12. Which are the Top-25 DiagnosisGroupCode with maximum number of non-fraudulent cases?

OBSERVATION

Q13. Does Age_groups have any relationship with maximum number of fraudulent cases?

OBSERVATION

Q14. Does Age_groups have any relationship with maximum number of non-fraudulent cases?

OBSERVATION

Q15. Which are the Top-25 States with maximum number of fraudulent cases?

OBSERVATION

Q16. What are the Top-25 States with maximum number of non-fraudulent cases?

OBSERVATION

Q17. Which are the Top-25 Country with maximum number of fraudulent cases?

OBSERVATION

Q18. What are the Top-25 Country with maximum number of non-fraudulent cases?

OBSERVATION

Q19. Does various Human Races have any relationship with maximum number of fraudulent cases?

OBSERVATION

Q20. Does various Human Races have any relationship with maximum number of non-fraudulent cases?

OBSERVATION